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Abstract 

The receiver operating characteristic (ROC) curve is a very useful tool for 
analyzing the diagnostic/classification power of instruments/classification 
schemes as long as a binary-scale gold standard is available. When the gold 
standard is continuous and there is no confirmative threshold, ROC curve 
becomes less useful. Hence, there are several extensions proposed for evalu- 
ating the diagnostic potential of variables of interest. However, due to the 
computational difficulties of these nonparametric based extensions, they are 
not easy to be used for finding the optimal combination of variables to im- 
prove the individual diagnostic power. Therefore, we propose a new measure, 
which extends the AUC index for identifying variables with good potential to 
be used in a diagnostic scheme. In addition, we propose a threshold gradient 
descent based algorithm for finding the best linear combination of variables 
that maximizes this new measure, which is applicable even when the number 
of variables is huge. The estimate of the proposed index and its asymptotic 
property are studied. The performance of the proposed method is illustrated 
using both synthesized and real data sets. 

Keywords: ROC curve. Area under curve. Gold standard. Classification 



Corresponding author: ycchang@stat. sinica. edu.tw 



Preprint submitted to Computational Statistical and Data Analysis 



January 20, 2013 



1. Introduction 



The ROC curve, founded on a binary gold standard, is one of tlie most 
important tools to measure the diagnostic power of a variable or classifier, 
and its importance has been intensively studied by man y authors, y hich 
can easily be found in t he lite rature and textbooks such as iPepd (120031 ) and 
Krzanowski and Handl (120091 ) . Moreover, when the number of variables is 



huge, many algorithms have been proposed for finding the best combination 



of var i ables t o inc r ease the individual cl assific ation accuracv (ISu and Liu 



( 11993f ). lPepel ( 120031 ) . iMa and Huand (l2005f ). and lWang et al.l (l2007af )l How- 
ever, in many classification or diagnostic problems, the professed binary gold 
standard is essentially derived from a continuous-valued variable. If there is 
no such confirmative threshold for the continuous gold standard, then the 
evaluation of variables/classifiers according to the ROC curve based anal- 
ysis may vary as the choices of thresholds change and therefore becomes 
less informative. For example, glycosylated hemoglobin is usually used as a 
primary diabetic control index, and is originally measured as a continuous- 
valued variable. Health institutes, such as the World Health Organization 
and National Institutes of Health (NIH), suggest a cutting point for it based 
on current findings for diabetic diagnosis and control. Once its cutting point 
is fixed, then the association between the variables of interests, such as new 
drugs, and this binary-scale standard can be evaluated using some ROC 
curve related analysis methods. However, as advances are made in science 
and medicine about this disease, this criterion will be re-evaluated and re- 
vised as necessary. Then, the performance evaluation of variables/classifiers 
may vary as the binary-recoding scheme is changed. It is clear that an unwar- 
ranted performance measure may result in misleading conclusions and may 
require re-evaluation of all the available diagnostic methods again every time 
a new standard is proposed. Hence, a measure that directly connects to the 
continuous gold standard is always preferred, which motivates our study of a 
new measure when the gold standard is continuous. Our goal in this paper is 
to find a robust measure, which is not affected by the choice of cutting point 
of a gold standard or how the binary outcome is derived from a continuous 
gold standard. 

Although there are a lot of reports about the RO C curve, there is still a 
lack of stu dy when the gold s tanda rd is not binary ( iKrzanowski and Handl . 



20091 ). In iHenkelman et al.l (Il990l ). they proposed a r naximum lik e lihoo d 
method under ordinal scale gold standard. Recently, IZhou et al.l ( 120051 ) . 
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Choi et al.l (120061 ) . and IWang et all (l2007bl ) considered the ROC curve es- 



timation problems based on some nonparametric and Bayesian approaches, 
when there is no gold standard. In addition, some ROC- type aii a lysis with- 
out a binary gold standard has been considered in lObuchowskil (120051 ) and 



Obuchowskil ( 120061 ) . where a nonparametric method is used to construct a 



new measure, and many other applications with continuous gold standard 
are discussed. However, these approaches, due to computational issue, are 
not easy to apply to the case that the optimal combination of variables is 
of interest; especially when the n umber of variables i s large as in modern 



biological/genetic related studies ( Waikar et al. ( 20091 )). 



In this paper, an extension of the AUC-type measure is proposed, which 
is independent of the choice of threshold of the continuous gold standard, and 
algorithms for finding the best linear combination of variables that maximizes 
the proposed measure are studied. Under the joint multivariate normality 
assumption, the algorithm for the linear combination can be founded us- 
ing the LARS method. When this joint normality assumption is violated, we 
propose a threshold gradient descent based method (TGDM) to find the opti- 
mal linear combination. Thus, our algorithms also inherit the nice properties 
of LARS and TGDM when dealing with the high dimensional and variable 
selection problems. Numerical studies are conducted to evaluate the perfor- 
mances of the proposed methods with different ranges of cutting points using 
both synthesized and real data sets. The estimate of this novel measure and 
its asymptotic properties are also presented. 

In the next section, we first present a novel measure for evaluating the 
diagnostic potential of individual variables and then an estimate of this mea- 
sure. The algorithms for finding the best linear combination are discussed 
in Section 3. Numerical results based on the synthesized data and some real 
examples follow. A summary and conclusions are given in Section 4. The 
technical details are presented in Appendix. 



2. An AUC-type Measure with a Continuous Gold Standard 

Before introducing a novel AUC-type measure based on a continuous gold 
standard, we first fix the notation and briefly review the definition of the ROC 
curve and its related measures. Let Z and Y be two continuous real-valued 
random variables, where Z denotes the gold standard and y is a variable 
of interest with diagnostic potential to be measured. Then, for example, Z 
is a primary index for measuring a disease and Y is some other measure of 
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subjects that is related to the disease of interest. In some medical diagnostics, 
the primary index is difficult to measure, and we are usually looking for 
variables that are strongly associated with Z and easy to measure, to be used 
as surrogates. That is why we need to evaluate the "level of association" 
of Y to Z . Likewise, in some bioinformatical studies, in order to develop 
new treatments, we would like to identify any strong associations between 
some genomic related factors Y to the continuous gold standard Z . Suppose 
that there is an unambiguous threshold c oi Z that can be used to classify 
subjects into two subgroups, and assume further that subjects with Z > c 
are classified as diseased, and otherwise as members of the control group. 
Then the ROC curve, for such a given c, is defined as ROC{t) = SoiS^^^it)), 
where Soit) = P{Y > t\Z > c) and Sdt) = P{Y > t\Z < c), and the AUG 
of variable Y is defined as 



where random variables Y^ and Y~ respectively denote the F-value of sub- 
jects of the disease and non-disease groups with density functions f{y\Z > c) 
and f{y\Z < c). That is, Y^ and Y~ are random variables for the sub- 
populations defined by {Z > c} and {Z < c}, respectively. It is clear that 
the AUC{c) defined in ([1]) is a function of c, which will change as the thresh- 
old c of Z varies. Hence, when the threshold is dubious, using AUC{c) as a 
measure may misjudge the diagnostic power of Y or the level of association 
between Y and Z. 

Let fc{t) be a probability density function defined on the range of possible 
values of c, then AU Cj is defined as 



Hence, by its definition, the proposed AUCj is independent of the choice of 
cutting point for the continuous gold standard, and any monotonic trans- 
formation of Y as well. This kind of threshold independent property is also 
one of the important properties of the ROC curve and AUG when used as 
measures of diagnostic performance. Since AUCj is defined as an integra- 
tion of AUC{c) over the range of possible cutting points with respect to a 
weight function fdt), the support of fdt) should be chosen as a subset of 
the support of the density of Z. Moreover, we can use fdt) to put different 
weights on all possible cutting points of Z if there is some information about 
the possible cutting point. If Z is an ordinal discrete variable, then there are 



AUC{c) = P{Y+ > Y-) 



(1) 



AUCi = J AUC{t)f,{t)dt. 



(2) 
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only countable cutting points, and fc(t) can be chosen as a probability mass 
function of all possible cutting points, and the integration of becomes 

AUCj =EuecAUC{U)m), (3) 

where C is a set of all possible cutting points. In particular, when Z is binary, 
we can let fc{t) be a degenerated probability density, then AUCj is the same 
as the original AUG. 

2.1. Estimate of AUCj 

Let random variables {Yi,Zi) denote a pair of measures from subject i, 
for i > 1. Suppose that {{yi,Zi), i = 1, . . . ,n} are n independent observed 
values of random variables (1^, Zj), i = 1, ■ ■ ■ ,n. For a given cutting point c, 
a subject -i = 1, . . . , n, is assigned ^ ii Zi > c and otherwise labeled 

as a "control". That is, for a given c, we divide the observed subjects into 
two groups; let Si{c) and So{c) be the case and control groups with sample 
sizes rii and no, respectively. It is obvious that these assignments depend 
on the choice of c. Then for a fixed c, the empirical estimate of AUC{c) is 
defined as 

Mc) = -^ E ^iv'^-yjl (4) 

I Of) ll^ 

i£Si(c);j£So{c) 

where iIj{u) = 1, if m > 0; = 0.5, if m = and = if m < 0. (It is 
easy to see that 74(c) does not exist, either c > max{2;j,z = l,---,n} or 
c < minj^j, i = 1, ■ ■ ■ , n}, since for these two cases, we have either rii = or 
hq = 0. Therefore, in this paper, we assume A{c) = 0.5 when either one of 
the cases occurs.) 

If the whole support of Z is considered as a possible range of cutting 
points, then a natural estimate of AUCj can be defined as 

Aj = j A{t)dF,{t), (5) 

where Fc{t) is the empirical estimate of the cumulative distribution function 
of Z based on {zi, . . . , Zn}. However, in practice, it is rare to choose cutting 
points at ranges near the two ends of the distribution of Z. Thus, instead of 
the whole range of Z, we might explicitly define a weight function fc{t) on a 
particular critical range. Below, we demonstrate three possible choices: (1) 
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a uniform distribution over the range of {—a,+a), where a is an empirical 
standard deviation of Z, say fi{t); (2) a normal density with sample mean 
fi and standard deviation a based on the observed values of Z, say /2(i); or 
(3) using a kernel density estimate, say fsit), to approximate the marginal 
density of Z. For different weight functions fj{t), j = 1, 2, 3, the estimate of 
AUCj is denoted as 

Aij = J A{t)t){t)dt. (6) 

It is clear that our method can be extended to other reasonable choices of 
weight functions. The theorem below states the strongly consistent property 
of Aij for all j. 

Theorem 2.1. Let {Y E R^,Z E R^) he a pair of random variables with uni- 
formly continuous marginal densities. Assume that {{yi., Zi), . . . , (?/„, z^)} 
are n observations of the independent and identically distributed random sam- 
ple (y^, Zi), i = 1, . . . ,n. Assume further that Z is the continuous gold stan- 
dard. Then for a given fc{t) = fj{t), j = 1, 2, 3, with probability one, 
Afj — AUCjj — )■ as n — 7- oo, where Ajj and AUCij are defined as in ([^ 
and^, respectively, with corresponding fcit) = fj(t). 

Proof of Theorem 12. Il Since bounded function A{c) converges almost surely 
to AUC{c) for all given c and fdt) is also bounded density function, the proof 
of Theorem 12.11 follows from the dominated convergence theorem. 

It is difficult to have an explicit form for the variance of Ajj due to its 
integral form. Thus, a bootstrap estimate of the variance of Ajj is used and 



denoted as V{Ajj). A similar idea is employed in lObuchowskil ( 120061 ). 

Remark 2.2. Note that the method for calculating (E]) may depend on the 
choice of weight function. If the empirical density of the gold standard is used, 
then the computation of it is straightforward; if a kernel density of the gold 
standard is used, then a numerical integration method is required. However, 
in all cases the computation of it are easy since it is an one- dimensional 
density. 

3. Linear combination of variables that maximizes AUCj 

For a classification or diagnostic problem, there are usually many vari- 
ables measured from each subject, and it is well known that a combination 
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of variables can usually improve on the classification performance of a single 
variable. This situation motivates us to study how to find the optimal linear 
combination o f variables that ma ximizes the proposed measure AUCj. For 



classical AUG, ISu and Liul (jl993[ ) studied the best linear combination under 
a multivariate normal distribution assumption. Here we extend their idea to 
AUCj. In addition, we also aim to address cases with huge number of vari- 
ables, which usually involve some computational issues and will be discussed 
later in this section. 

3.1. Optimal Linear Combination of Variables Under Joint Normality 

For clarity and convenience, we start with a bivariate normal distribu- 
tion case, since the linear combination of variables, for a given vector of 
coefficients, can be treated as a single variable. 

Let U = {Y, Z)'^ be a random vector following a bivariate normal distri- 
bution with mean vector // = (/ii,/i2)^ and covariance matrix 



u 



a{ paia2 



Suppose that Ui = (Yi,Zi)'^, i = 1,2, are two independent random vectors 
generated from the same distribution of U. Define 

Q.^expf- W-''™-^' Y.^l,2. 



Then for a given c, 

pr(Yi > Y2, Zi > c, Z2 < c) = J J J ^2^ I dz2dzidy2dyi, (7) 

where denotes the determinant of matrix S^/. The conditional dis- 
tribution of Yj given Z = Zj is a normal distribution with mean flj = 
fJ'i + o'i/a2p{zj — 112) with j = 1, 2 and variance al = (1 — p^)(Tf. Let 
r]{zi, Z2) = l/(27rcT|) exp(-((zi - ^2? + {z2 - P2?)/{2al)). Then, © can be 
rewritten as 

pr{Y,>Y2,Z^>c,Z2<c) 
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dyidz2dzi 

= r r v{z,,z2)Em^^^^±^^))dz2dz, 

Jc J-oo (^1 

= r r viz^,Z2)E{^V+ ^1'' ~ ))dz2dz,, (8) 

Jc J-oo (T2(l-p^)l/^ 

where V is a. standard normal random variable and $ is the standard normal 
cumulated distribution function. Note that under normality assumption, p = 
implies that Y and Z are independent, and it follows from ([8]) AUCj = 0.5 
in this case. 

Now, suppose that X = (Xi, . . . ,Xp)^ is a p-dimensional random vec- 
tor of measures of a subject, and Z is the continuous gold standard as 
before. Suppose / G and let Y = {^X be a linear combination of 
X. Assume further that X follows a multivariate normal distribution with 
mean vector p* and covariance matrix S. Then Y follows a normal dis- 
tribution with mean pi = I'^fi* and variance = I'^Hl. The correla- 
tion coefficient between Y and Z is p = l^cov{X , Z) /((l^J2iy^'^a2), where 
cov{X,Z) = {cov{Xi, Z), . . . ,cov{Xp, Z))^ . Then, AUCj for such a linear 
combination of Xj's, Y = I'^X, is a function of /: 

AUCi{l) = j pr(FXi > fX2\Zi >t,Z2< t)f^{t)dt (9) 

where {Xf , Z^y , i = 1,2, are independent identically distributed samples of 
(X-^, Zy. Our goal is to find the optimal linear combination of Xi, . . . , Xp 
such that AUCi is maximized and it is known that AUG is scale invariant. 
In order to make the solution identifiable, we search for an lopt such that 
AUCj{lopt) > AUCiil) for all possible / G BP with ||/|| = 1. 
From (iD, 

Therefore, 

dAUC,{l) dp f , /-^ /•' 1 / (Zi - ^2)' + (22 - ^2) = 

P^{zi-Z2y\ Z1-Z2 
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where A dentes the integration part of the left hand side of (ITT]) . Since 
A > 0, the equation dAUCj{l) /dl = if and only if dp/dl = 0; that is, 

d Fcov(X, Z) _ 

It implies that the optimal linear combination coefficient 

lopt = ^~^ COY {X,Z). (12) 



Note that, as in ISu and Liul (119931 ). this optimal linear combination coeffi- 
cient lopt is independent of c, and depends only on the covariance matrix of 
variables and the covariance between of variables and the gold standard. 

3.2. Estimation of the Optimal Linear Combination 

Assume that {(xj, Zi),i = 1, ■■■ ,n} is a. set of n independent and identi- 
cally distributed random samples, where Zi denotes the observed gold stan- 
dard measures as before, and Xi is its corresponding p-dimensional vector 
of observed variable values of subject i. Without loss of generality, we 
assume that all the components of x and z are centralized, since we can 
always centralize the data by subtracting their sample means, and define 
H = {xi — X, ■ ■ ■ ,Xn — xY as an 77, X p matrix, and z = {zi — z, ■ ■ ■ , Zn — zY 
as a vector of length p, where x = ^"^^ Xi/n and z = ^^^^ Zi/n. Hence, the 
estimate of lopt based on a sample of size n following from (|T2|) is defined as 

i={H^H)-^H''i. (13) 

Similarly to the linear regression problem, it is clear that / is a strongly 
consistent estimate of lopt under some regularity conditions on X and Z. 
Define 



A{c,l) = ^— V i){fxi-fxj). (14) 

*G5i(c);jg5o(c) 

Then 

Ai{l) = j A{t,l)Ut)dt (15) 

is an estimate of AUCj{l). It is easy to see that for given t, A{t,l) con- 
vergenes to A(t, /) uniformly with respect to I. Hence, using the dominated 
convergence theorem, it is shown that Aiil) is a strongly consistent estimate 
of AUCj{lopt) and the details are omitted here. This result is stated as a 
theorem below: 



9 



Theorem 3.1. Suppose that the joint distribution of X G RP, Z & follows 
a multivariate normal distribution, where Z is the continuous gold standard, 
and X denotes the p-dimensional vector of variables. Let {{Xi, Zi), ■ ■ ■ , Z„ 
be independent and identically distributed samples of size n. Then for a given 
density fcit), with probability one. 

Alii) - AUCiilopt) — > 0, as n oo, 

where AUCiilopt) and Aj(l) are defined as in ^ and with I = lopt and 
I, respectively. 

Equation f|T3|) provides a neat solution for the best linear combination 
of variables under a joint multivariate normality assumption. However, it 
can be seen from f lT^ that the calculation of / relies on the computation of 
an inverse matrix. Thus, when the number of variables is large, the direct 
calculation of / using ( IT3|) becomes numerically unstable. The situation is 
worse, when the sample size is relatively small compared to the number of 
variables. So, we need an alternative numerical approach that can handle 
problems with large p to overcome this obstacle. 

Again, from f fT3|) . we find that the estimate / can be viewed as a least 
square estimate of / in the linear regression model below: 

2 = Hl + e, (16) 

where e is an n-dimensional vector of random error. When p is small, then 
the solution can be obtained easily as in regression problems. When p is 
l arge, then we can apply the least angle regression shrinkage (LARS) method 



(lEfron et al.l . |2004| ) to f|T6|) to obtain an estimate of /. Since this is the same 
as applying LARS in a regression setup, the properties of LARS are therefore 
inherited. With the assistance of LARS, the proposed measure can be applied 
to evaluate linear combinations of lengthy variables. The variable selection 
scheme will follow from LARS as it is used in regression models. However, 
when the normality assumption is violated or the normal approximation to 
the joint distribution is not adequate, the empirical results show that the lopt 
defined in f ll2p is not a good solution. Thus, an alternative algorithm, which 
does not rely on the normality assumption, is required and developed below. 

Remark 3.2. Since the properties of applying LARS to find the linear com- 
bination of variables are the same as those in linear regression. We omit the 
details of applying LARS under the normality assumption. Instead, we focus 
on the case without a normality assumption. 
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3.3. When the Joint Distribution is Unknown 

As before, let's start with a one-dimensional case, and the case with a 

linear combination of variables will follow easily as an exte nsion. 

Sira ilarly to the methods used in iMa and Huang and IWang et al 



( l2007al ). we first use a sigmoid function S{t) = 1/(1 + exp(— t)) to approx- 



imate ip{') ill equation (12T]) . Thus, a smooth estimate of AUCj is defined 



as 



Ajs= E s(\yi]mdt. (17) 



riinn ^ \ h 

i(^Si{t);jeSoit) ^ 

It follows from the results in density estimation literature that for a suffi- 
ciently small window width h, S{{y — x)/h) ^ 'ip{{y — x), which implies the 
following asymptotic properties of Ajs'. 

Theorem 3.3. Assume that {{yi, zi), ■ ■ ■ , (?/„, Zn)} are n independent and 
identically distributed samples of {Y G R^,Z G R^), where Z denotes a 
continuous gold standard. Denote the marginal densities of Y and Z by 
fy and fz, respectively. Let F{z\y) be conditional cumulative function of Z 
given Y = y. Suppose that fy and fz are larger than and bounded. Assume 
both fy{-) and F[z\-) are uniformly continuous. Then for a given probability 
density fdt) with h = 0{n~''), 1/5 < a < 1/2, 

Ajs — AUCj — > almost surely as n ^ oo, 

where AUCj and Ajs are defined in ^ and [T7\ ), respectively. 



(The proof of Theorem 13.31 relies on some classical results of density ap- 
proximation theory. The details are given in Appendix A.) 

As before, we replace y in (ITTI) with /^x, then we have the smooth estimate 
of AUCi{l) below: 

^■M^l^, E s(f^i^f.m. (18) 

The asymptotic property of Ajs{l) follows easily from Theorem 13. 3[ and is 
summarized as the following theorem without proof. 
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Theorem 3.4. Suppose that {(xi, Zi), ■ ■ ■ , Zn)} are n independent and 
identically distributed samples of {X G G R^), where Z denotes the 

continuous gold standard, and X is a vector of corresponding variables. Let 
fc{t) be a probability density. Assume that for a given constant vector I G 
RP, the conditions of Theorem l{3.3\) holds for Y = t^X and Z . Then for 
h = 0(n-") with 1/5 <a< 1/2, 

Ais{l) — AUCi{l) —7- almost surely as n oo, 

where AUCi{l) and Ajs{l) are defined in ^ and l[T8\) . respectively. 

Remark 3.5. We only need to estimate the density function of the linear 
combination l^X G R^, hence the choice of h does not depend on the length of 
total variables p. Thus, the density estimation part of the proposed algorithm 
will not suffer from the curse of dimensionality. 

Followin g Theorem 13.31 we apply the t hreshold gradient descent method 



(TGDM) of iFriedman and Popescul (120041 ) to find the best linear combina- 



tion, I which maximizes Ajs{l). That is, to find a solution 

/ = argmax;A/s(/). (19) 



From equation f[T8|) . we know that AUCjs is also scale invariant as is AUG. 
That is, Ais{l) with window width h will equals to Ajs{kl) with h = kh for 
a positive constant k. Hence, an anchor variable is needed such that the 
solution of ( 1T9|) is unique. 

TGDM Based Algorithm Let {{xi, zi), - ■ ■ , {xn, Zn)} be a set of random 
samples of size n, which satisfies the assumption of Theorem 13.41 Define 
s = (si, ■ ■ • , Sp)'^ as a p-dimensional vector with Sj = 1, if the corresponding 
empirical AUCj of the ith variable is greater than 0.5; otherwise set Si = —1. 
Let Pi be a p-dimensional vector where only the ith component equals Si and 
otherwise. Define Ri = Ais{f3i), then choose the variable with the maxi- 
mum Ri value as the anchor variable. In the following algorithm, we assume 
that Ri > Ri, for i = 2, ... ,p without loss of generality. Let notation li de- 
note the ith component of /, then li is the coefficient of the anchor variable. 
In order to make the coefficients identifiable, we set ||Zi|| = 1. Following the 
notations defined above, a TGDM-based algorithm for finding the best linear 
combination of variables that maximizes AUCis is stated below: 



Algorithm: 
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(0) Initial stage: Let r = and choose a threshold parameter r. Set 

/(°) = (.i,o,---,or. 

(1) Given / = calculate the derivative of the smoothed estimate 
Ajsil) with respect to hnear coefficient/, d{l^'^^) = (di^l^"^^) , ■ ■ ■ , dp{l^^^))'^ = 

(2) Use the threshold gradient descent method to calculate / = 1''q~^^^', 
that is, 4''+^^ = + 6 t{T,&-^) d{&^) for some 5 > 0, where t(r,/(")) 
is an indicator vector 

/ {d{&^) > T msix{di{&^), ■ ■ ■ , dp{&'^)}) . 

(3) Find the optimal 5* = aigmaxs^QAjsili^^^^) with l^J^^^ = + 
5t(r,/M) and update = +5*t(r,/W) 

(4) Repeat steps (l)-(4) until Ajs{l^^^^^) converges. 

Remark 3.6. The initial value of / is chosen as (si, 0, ■ ■ ■ , 0)"^, since the first 
component of / corresponds to the selected anchor variable. In Step (2), we 
update Z'-^-' along the direction t(r, Z*-''^) d{l^^'*), where the number of nonzero 
components is decided by the threshold parameter r, and by the definition of 
t(r, Z^*"^), the locations of nonzero components of t(r, Z*-''^) are determined by 
the elements of gradient d{&'^). Step (3) is to find a suitable step size S* along 
the direction of Step (2), then update the linear coefficients of variables. The 
criterion of convergence of Step (4) has to be predetermined. 
(The software used in this paper (GoldAUC) is available at 
http://idv.sinica.edu.tw/ycchang/software.html). 



4. Numerical studies 

In numerical studies, we calculate the proposed measures Ajj, j = 1, 2, 3, 
corresponding to 3 different fc(t) as defined before. Since the correlation 
coefficient is a basic statistic to measure the association between two contin- 
uous variables, we therefore include it in our experimental studies. We also 
compare the performances of our methods with that of Obuchowski's (2006) 
method (page 485, Equation (9)) described below: 

^ n n 
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where « 7^ j, 



ip {yi,Zi,yj,Zj) = 1 if yi > Yj and Zi > zj, or ji < j] and Zi < Zj; 
= 0.5 if Yi = Yj or Z; = Zj; 
= otherwise. 

The sample sizes used in our numerical studies are n = 50 and 100. 
The window width for the kernel estimate in Aj^ is equal to n^^^. The 
bootstrap sample size for estimating the variance of each case is 200, and 
there are 100 replicates for each simulation setup. For the first experimental 
study, the data are generated from bivariate normal distributions with means 
A^i = = 1-0, standard deviations (o"i,cr2) equal to (1.0,1.0), (1.0,2.0), 
(2.0,1.0) and (2.0,2.0), and correlation coefficients equal to p = 0.0, 0.25, 
0.5, 0.75, and 1.0. Let fi and denote the sample mean and variance of z. 
As in the classical ROC curve analysis, when a variable with no diagnostic 
power, then its corresponding ROC curve will be the 45 degree diagonal 
line of the unit square. If this case holds for all possible cutting points, 
then it implies that AUCi = 0.5. So, we use 0.5 as the value of the null 
hypothesis in our numerical study. Table [1] shows five statistics for different 
simulation setups: correlation coeffici ent of two variables p, Ajj, j = 1,2,3 



with corresponding /c(t)'s, and 6 from lObuchowskil (j2006[ ). Figure 1 is a plot 



of statistics p^/V{p), {Ajj - Q.hY/V{Aij) for all j's, and {6 - Q.hf/V{e) 
versus p, where V{f)) and V{9) are the bootstrap estimates of variances of p 
and 6, respectively. 

When the joint distribution of two variables follows a bivariate normal 
distribution, the correlation coefficient is a natural statistic to describe the 
association between the two variables. In our study, all five measures increase 
as the true correlation coefficient p increases, which suggests that all measures 
catch the linear association between variable Y and the gold standard Z as 
expected. In fact, Ajj and 9 are very close to their true values 0.5 and 1.0, 
when p are equal to 0.0 and 1.0, respectively. In addition. Figure [T] shows 
that the values of fP'/V{p) and {Aij - 0.5)^/1^(74/^), j = 1,2,3, are larger 
than those of {9 — 0.5Y/V{9) under current simulation set up. 

Table [2] shows the results of five measures when there is no association 
between variable Y and the gold standard Z. That is, the data set used in this 
table are generated from the model y = z"^ -\- e with standard normal error e, 
where the gold standard z is generated from three different distributions: (1) 
normal distribution, (2) ^2 distribution with free degree 2, and (3) a Cauchy 
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Table 1: Comparison of five measure indexes: /5, Ajj, j = 1, 2, 3, and 9, where the marker 
and gold standard, (y, z), follow multi-variate normal distribution with means ^.i = /.i2 = 
1.0, with different standard deviations ai, oi and distinct correlation coefficients p. 



" (o'l, 0-2) 


Method 




0.0 






0.25 












0.5 








0.75 






1.0 




50 (1.0, 1 


0) 


P 


0.105(0. 


076, 


0.140)*0 


252(0 


118, 





130 





511 





088, 





103)0 


747 





067, 





064 


1.000(0.000, 


0.000) 






4/1 


0.505 





064, 


0.073) 





621(0 


063, 





069 





746 





053, 





058)0 


866 





040, 





038 


1.000(0.000, 


0.000) 






All 


0.501 





065, 


0.067) 





616(0 


062, 





063 





743 





046, 





052)0 


856 





035, 





033 


0.979(0.010, 


0.013) 








0.498 





065, 


0.066) 





611(0 


062, 





062 





737 





045, 





051)0 


846 





037, 





034 


0.968(0.011, 


0.015) 






§ 


0.504 





044, 


0.049) 





583(0 


044, 





048 





673 





038, 





044)0 


771 





036, 





035 


1.000(0.000, 


0.004) 


(1.0, 2 


0) 


P 


0.106 





073, 


0.136) 





263(0 


118, 





131 





477 





099, 





109)0 


750 





058, 





065 


1.000(0.000, 


0.000) 






All 


0.497 





067, 


0.073) 





621(0 


061, 





070 





730 





053, 





061)0 


862 





034, 





040 


1.000(0.000, 


0.000) 






A12 


0.495 





065, 


0.066) 





622(0 


061, 





064 





729 





053, 





054)0 


859 





029, 





033 


0.980(0.008, 


0.010) 






A,3 


0.496 





065, 


0.066) 





622(0 


062, 





064 





729 





051, 





054)0 


858 





030, 





034 


0.983(0.004, 


0.009) 









0.498 





044, 


0.049) 





583(0 


043, 





049 





660 





038, 





044)0 


769 





032, 





036 


1.000(0.000, 


0.004) 


100(1.0, 1 


0) 


P 


0.085 





056, 


0.098) 





253(0 


083, 





092 





497 





082, 





075)0 


747 





046, 





044 


1.000(0.000, 


0.000) 






All 


0.490 





050, 


0.051) 





620(0 


043, 





048 





739 





046, 





041)0 


865 





024, 





027 


1.000(0.000, 


0.000) 






A12 


0.485 





053, 


0.049) 





622(0 


041, 





046 





741 





042, 





038)0 


864 





023, 





023 


0.987(0.007, 


0.008) 






A,3 


0.483 





054, 


0.049) 





620(0 


042, 





045 





739 





042, 





037)0 


861 





024, 





023 


0.982(0.006, 


0.009) 






9 


0.493 





033, 


0.034) 





581(0 


029, 





033 





668 





033, 





030)0 


771 





023, 





024 


1.000(0.000, 


0.001) 


(1.0, 2 


0) 


P 


0.075 





057, 


0.097) 





266(0 


100, 





091 





499 





081, 





074)0 


739 





045, 





046 


1.000(0.000, 


0.000) 






All 


0.496 





049, 


0.051) 





625(0 


053, 





048 





739 





042, 





041)0 


859 





025, 





027 


1.000(0.000, 


0.000) 






A12 


0.493 





050, 


0.049) 





629(0 


051, 





045 





744 





041, 





037)0 


862 





024, 





023 


0.987(0.006, 


0.006) 






Al3 


0.494 





049, 


0.049) 





630(0 


052, 





045 





745 





041, 





037)0 


862 





024, 





024 


0.99(0.003, 0.005) 






9 


0.498 





032, 


0.034) 





586(0 


036, 





033 





667 





031, 





030)0 


765 





022, 





024 


1.000(0.000, 0.001) 



* Empirical standard deviations and moan values of bootstrap standard deviations are in parentheses. 
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{a„a,M2.0,^.0) 



{c„c,)=(2.0,2.0) 




p (n=50) 




{oi,O2)=(2.0,1.0) 



(ai,a2)=(2.0,2.0) 




0.2 0.4 0.6 

p (n=100) 




0.2 0.4 0.6 

p {n=100) 



Figure 1: Comparison of five measures: p'^/V{p), {Ajj — 0.5)"^ /V{Aij), j = 1, 2, 3, and 
{6 — 0.5}^ /V{9), where {Y, Z) follow bivariate normal distributions with means p,i — fJ,2 = 
1.0, with different standard deviations ci, cr2 and correlation coefficients p. 
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distribution. Since z has symmetrical density functions for all three cases, 
it is clear that there is no association between Y and Z . That is, the ideal 
values of the correlation coefficient estimate ROC- type indexes estimates 
\Aij — 0.5|,j = 1,2,3 and \d — 0.5| should be close to 0. We calculate the 
25%, 50% and 75% empirical quantiles based on 100 simulations. The 
values, with a nominal significance level equal to 0.05, for statistics fP' /V{f)), 
{Aij - 0.5)V^(^i), 3 = 1,2,3 and {6 - Q.bf/V{e) are also reported. It is 
seen from Table [2] that all three quantiles of Ai^ and 9 are very close to 0, 
while the correlation coefficient seems to over-estimate the association of Y 
and Z in this experiment. When the tail of the distribution of Z becomes 
heavier, the quantiles and p-values of p become further from 0.0 and nominal 
0.05, respectively. Especially, when Z is from a Cauchy distribution, the 25% 
quantiles are larger than 0.5 and the corresponding p- values are greater than 
0.3. 

The performances of A/s and 9 are better than those of An and A12 when 
Z is not from a normal distribution. This is because ^4/3 is based on a kernel 
estimate of fc{t) and 9 is founded on a nonparametric method, they are not 
affected by the distribution of Z, and therefore very stable even when Z is 
not normally distributed. 

As a summarization and conclusion to the results of Figure [H and Tables [T] 
and [21 both A/3 and 9 are recommended for detecting the association between 
variables and the continuous gold standard. Although 9 is considered as a 
natural extension of the ordinary AUG index, it is worth noting that the 
performance of A/j (especially A/3), in these cases, are are very competitive. 

4-1. Combination of Variables 

Both correlation coefficient (CC) and the TGDM algorithm are used to 
obtain the optimal linear combinations of variables. We then calculate A/3 
and 9 of the corresponding combination of variables based on the coefficient 
vectors obtained from these two methods. The threshold parameter r in the 
TGDM algorithm is equal to 1.0 in our studies. The data set are generated 
from Z = t^X + e, where X follows a p dimensional multivariate normal 
distribution with mean vector (0, . . . , 0)"'" and an identity covariance matrix, 
and the true / = (1.0, 1.0, 0.0, ■ ■ ■ , 0.0)^. Error term e is generated from either 
the standard normal distribution or a Cauchy distribution. In this experi- 
mental study, we have tried three different dimensions oi X [p = 4, 10, 20) 
for all cases, and only variables Xi and X2 have non-zero coefficients. That 
is, only these two variables are associated with the gold standard. Moreover, 
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Table 2: Comparison of different methods when there is no association between variable Y and the gold 
standard Z. The data set (j/, z) is generated from model y ^ z'^ + e with standard normal error e. Three 
different distributions of z are used, which are a normal distribution, a t2 distribution with free degree 2 
and a Cauchy distribution. 







Normal 




t2 




Cauchy 




n 


Model 


(25%, 50%, 75%) 


p- value* 


(25%, 50%, 75%) 


p- value 


(25%, 50%, 75%) 


p- value 


50 


P 


(0.086, 0.175, 0.287) 


0.13 


(0.277, 0.544, 0.802) 


0.30 


(0.515, 0.834, 0.948) 


0.33 




An 


(0.031, 0.060, 0.110) 


0.09 


(0.048, 0.079, 0.125) 


0.12 


(0.044, 0.074, 0.125) 


0.13 




Al2 


(0.025, 0.054, 0.090) 


0.06 


(0.041, 0.076, 0.120) 


0.13 


(0.040, 0.078, 0.150) 


0.15 




Al3 


(0.022, 0.050, 0.087) 


0.06 


(0.036, 0.060, 0.090) 


0.09 


(0.028, 0.055, 0.088) 


0.09 




§ 


(0.021, 0.043, 0.081) 


0.08 


(0.034, 0.060, 0.087) 


0.09 


(0.025, 0.049, 0.090) 


0.08 


100 


P 


(0.055, 0.114, 0.202) 


0.07 


(0.305, 0.527, 0.728) 


0.27 


(0.603, 0.825, 0.931) 


0.37 




An 


(0.022, 0.044, 0.072) 


0.07 


(0.016, 0.035, 0.072) 


0.06 


(0.029, 0.052, 0.098) 


0.15 




Al2 


(0.020, 0.033, 0.059) 


0.06 


(0.014, 0.033, 0.064) 


0.05 


(0.026, 0.048, 0.096) 


0.14 




An 


(0.017, 0.034, 0.060) 


0.06 


(0.009, 0.032, 0.056) 


0.04 


(0.018, 0.041, 0.07) 


0.08 




9 


(0.015, 0.032, 0.049) 


0.07 


(0.012, 0.031, 0.054) 


0.04 


(0.014, 0.036, 0.065) 


0.06 



*Nominal significance level is 0.05. 



a software based on the TGDM algorithm to calculate the optimal linear 
combination of variables is available as an R package. I t is als o worth noting 
that there is no algorithm or discussion in lObuchowskil ( 120051 ) about finding 
the linear combination of variables based on 6. 

Table [3] lists the values of Aj^ and 6 for individual variables, xi and X2, 
and the linear combinations based on the CC and TGDM methods. From 
this table, we find that A/3 and 6 for linear combinations of variables are 
always larger than for individual variables, which confirms that linear com- 
binations of variables can improve on the the diagnostic power of individual 
variables. When e follows the standard normal distribution, Aj^ and 6 for 
linear combinations based on both TGDM and CC are very close. However, 
when e is a Cauchy distribution, the TGDM method has larger Aj^ and 6 
than combinations based on CC. This is because the CC method relies on 
the normality assumption, while TGDM does not. In addition, from Table 
[31 we can see that Ajs is larger than 6. In most of the cases, the standard 
deviations of TGDM are smaller than those of 6, which suggests that the lin- 
ear combinations based on TGDM have greater diagnostic power, although 
the difference may not be statistically significant in our simulation. 
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Table 3: Results of linear combination using correlation coefficient (CC) and TGDM 
method. 



Nonzero coef. 



Distribution p** 


n 


Method 


Xi 


X2 


CC 


TGDM 


Normal 4 


50 




0.773(0.054)* 


0.786(0.052) 


0.900(0.024) 


0.900(0.028) 






9 


0.694(0.043) 


0.702(0.042) 


0.815(0.028) 


0.815(0.031) 




100 




0.782(0.033) 


0.785(0.035) 


0.904(0.018) 


0.906(0.018) 






§ 


0.693(0.027) 


0.696(0.030) 


0.807(0.021) 


0.809(0.021) 


10 


50 




0.785(0.048) 


0.773(0.046) 


0.909(0.021) 


0.900(0.031) 






§ 


0.703(0.037) 


0.692(0.040) 


0.824(0.027) 


0.815(0.033) 




100 




0.791(0.036) 


0.789(0.032) 


0.913(0.015) 


0.913(0.016) 






§ 


0.699(0.030) 


0.700(0.025) 


0.818(0.019) 


0.817(0.020) 


20 


50 


Al3 


0.767(0.051) 


0.779(0.053) 


0.928(0.018) 


0.897(0.034) 






9 


0.689(0.042) 


0.698(0.042) 


0.852(0.026) 


0.813(0.039) 




100 


i/3 


0.782(0.033) 


0.783(0.032) 


0.922(0.015) 


0.915(0.016) 






e 


0.693(0.028) 


0.696(0.025) 


0.828(0.019) 


0.820(0.019) 


Cauchy 4 


50 


Al3 


0.659(0.067) 


0.640(0.068) 


0.669(0.107) 


0.735(0.073) 






9 


0.629(0.046) 


0.614(0.046) 


0.619(0.088) 


0.685(0.059) 




100 


i/3 


0.660(0.056) 


0.657(0.047) 


0.659(0.094) 


0.724(0.077) 






9 


0.629(0.036) 


0.625(0.032) 


0.615(0.078) 


0.679(0.063) 


10 


50 


Aia 


0.648(0.064) 


0.645(0.072) 


0.690(0.099) 


0.750(0.067) 






9 


0.620(0.045) 


0.618(0.048) 


0.628(0.079) 


0.689(0.056) 




100 


i/3 


0.648(0.083) 


0.638(0.082) 


0.664(0.104) 


0.733(0.101) 






9 


0.625(0.033) 


0.618(0.035) 


0.614(0.063) 


0.683(0.061) 


20 


50 


i/3 


0.647(0.093) 


0.657(0.096) 


0.740(0.123) 


0.789(0.096) 






9 


0.623(0.044) 


0.628(0.046) 


0.665(0.083) 


0.719(0.052) 




100 


i/3 


0.634(0.123) 


0.638(0.120) 


0.649(0.142) 


0.739(0.147) 






9 


0.624(0.032) 


0.627(0.029) 


0.604(0.068) 


0.689(0.069) 



'^Nonzero coe f. represents variables with non-zero coefficients in true model. 
'Empirical standard deviations are in parentheses. 

**p denotes number of total variables in true model and the number of 



non-zero variables is pi = 2. 
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4-2. Real examples 

We apply the proposed measures to th ree real data sets: tumor, prostate 



and d iabet es data sets, which ar e used in lObuchowskil (120051 ) . IStamey et al 



( 119891 ) and IWillems et al.l (Il997l ). respectively. In the tumor data set, there 
are 74 patients and only two surgery variables: the computed tomography 
(CT) and a fictitious test (Fi). The continuous gold standard of this data 
set is the size of the renal tumor mass. The prostate data has 97 patients 
with prostate specific antigen as its gold standard together with 6 continu- 
ous variables, which are cancer volume, prostate weight, age (Age), benign 
prostatic hyperplasia amount, capsular penetration, and percentage Gleason 
scores 4 or 5 (Pgg45). Except variables Age and Pgg45, the others are re- 
coded in log-scale and denoted by Lcavol, Lweight, Lbph, Lcp and Lpsa, 
acco rdingly. The or i ginal diabetes data consists of 403 subjects, but we fol- 
lows IWillems et al.l ( 119971 ) to delete 22 subjects with missing variables. Of 
the remaining 381 subjects from this data set used in our numerical study, 
222 are females and 159 are males. The following 8 continuous variables are 
used in this data set: total cholesterol (Choi), stabilized glucose (Stab.glu), 
high density lipoprotein (Hdl), cholesterol/HDL ratio (Ratio), age (Age), 
body mass index (BMI) and waist/hip ratio (WHR). The gold standard for 
this data set is glycosylated hemoglobin (Glyhb), which is commonly used 
as a measure of the progress of diabetes. In addition to analyzing the entire 
diabetes data set, we also investigate female and male subgroups, separately. 

We normalize the data before applying the proposed measures to each 
data set to avoid scale variations. Table H] presnets Aj^ and 6 for individual 
variables with value less than 10~^. From Table HI we find that Aj^ selects 
more variables than 6 for some cases. Note that Aj^ are much larger than 6 
with competitive standard deviations in these cases. 

Table E] lists the linear coefficients obtained using the TGDM and CC 
methods, and their corresponding Aj^ and 9 values for all data sets, including 
the male and female subgroups of the diabetes data set. In the tumor data 
set, Fi has a larger Aj^ value than CT; that is, Fi has a greater association 
with the size of the renal tumor mass for tumor data. In the prostate data 
set, Lcavol has the largest A73 value; that is, Lcavol is most highly associated 
with prostate specific antigen among all variables considered in the prostate 
data set. For the diabetes data set and its male and female subgroups, the 
largest A13 and the variable with the largest coefficient value is Stab.glu; 
that is, Stab.glu has the highest potential to diagnose diabetes in terms of 
glycosylated hemoglobin index. As expected, from Tables S] and El the linear 
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Table 4: Results of ROC measure indexes: Aj^ and 6, of single markers for 
tumor, prostate, diabetes, diabetes-female and diabetes-male data sets. 



Tumor 



Data 




Method 


CT 




Fi 


Tumor 




An 


0.943(0.014)* 


0.982(0.011) 






§ 


0.871(0.020) 


0.956(0.008) 


Prostate 


Data 


Method 


Lcavol 


Lweight 


Lcp 


Pgg45 


Prostate 


Al3 


0.865(0.022) 


0.722(0.034) 


0.759(0.035) 


0.744(0.035) 




§ 


0.758(0.027) 


0.647(0.027) 


0.675(0.031) 


0.676(0.028) 


Diabetes 


Data 


Method 


Choi 


Stab.glu 


Ratio 


Age 


Diabetes 


Al3 




0.779(0.021) 


0.662(0.022) 


0.711(0.019) 




e 




0.687(0.017) 


0.600(0.015) 


0.644(0.014) 


Diabetes- 


All. 


0.667(0.029) 


0.786(0.022) 




0.732(0.025) 


female 


e 




0.691(0.021) 




0.665(0.019) 


Diabetes- 


Al3 




0.769(0.039) 


0.689(0.034) 


0.681(0.030) 


male 


9 




0.682(0.030) 







*Bootstrap standard deviation is in parentheses. 



combinations based on TGDM and CC usually have larger Aj^ and 6 values 
than individual variables do, and similarly, Aj^ and 6 values for combinations 
from TGDM are a little bit larger than those obtained using the CC method. 
In real data sets the relation is seldom linear, which is the reason why the 
combinations obtained using TGDM perform better than others. 

5. Conclusion and Discussion 

In this paper, we first propose a new measure for evaluating the poten- 
tial diagnostic power of individual variables, when there is only a continuous 
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Table 5: Results of optimal linear coefBcients and corresponding ROC measure indexes: A13 and 
9, for tumor, prostate, diabetes, diabetes-female and diabetes-male data sets. 



Tumor 



Data 


Method 






Coef. 










ROC-type indexes 








CT 




Fi 






A 


/3 


9 


Tumor 


CC 




-0.118 




1.076 






0.981(0.011) 


0.950(0.009) 




TGDM 




0.044 




1.000 






0.983(0.011) 


0.957(0.008) 


Prostate 


Data 


Method 






Goof. 








ROC-type indexes 




Lcavol Lweight 


Age 


Lbph 


Lcp 


Pgg45 




e 


Prostate 


CC 0.642 


0.214 


-0.118 


0.099 


0.017 


0.147 


0.892(0.018) 


0.791(0.024) 




TGDM 1.000 


0.264 


-0.108 


0.135 


-0.013 


0.189 


0.891(0.017) 


0.789(0.023) 


Diabetes 


Data 


Method 








Goof. 








ROC-type indexes 






Choi 


Stab.gl 


u Hdl 


Ratio 


Ago 


BMI 


WHR 


A/3 


e 


Diabetes CC 


0.074 


0.668 


0.018 


0.101 


0.101 


0.017 


0.019 


0.816(0.017) 


0.717(0.015) 




TGDM 


0.061 


1.000 


-0.027 


0.099 


0.373 


0.140 


0.011 


0.826(0.018) 


0.723(0.016) 


Diabctos-femalo CC 


0.109 


0.659 


-0.073 


0.027 


0.106 


0.029 


0.069 


0.834(0.021) 


0.737(0.019) 




TGDM 


0.253 


1.000 


-0.164 


-0.007 


0.389 


0.133 


0.199 


0.842(0.019) 


0.741(0.018) 


Diabetes-male CC 


-0.005 


0.701 


0.141 


0.243 


0.085 


-0.049 


-0.002 


0.786(0.03) 


0.691(0.025) 




TGDM 


-0.016 


1.000 


0.009 


0.179 


0.367 


0.100 


-0.040 


0.811(0.031) 


0.706(0.027) 


*ROC-typ^ 


e indexes used here are 


AUC13 


and 8. 
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gold standard available and no confirmative threshold for it is known. The 
proposed measure is an AUC-type index that shares the threshold indepen- 
dent property of the ROC curve and AUG, and can also be used to evaluate 
the performance of classifiers when the gold standard variable is essentially 
continuous, and the threshold is controvertible. Numerical results show that 
the proposed novel index is very competitive to the existence method. 

In addition, we propose algorithms, based on the newly defined index, for 
finding the best linear combination of variables, which is useful from a prac- 
tical prospect when there are multiple variables considered at a time, and 
how to evaluate or select a good combination of variables is an important 
issue. Here we also study numerical methods for finding the linear combina- 
tion of variables that maximizes the proposed measure. When the normality 
assumption of variables is valid, the best linear combination solution can be 
realized as a solution to a linear system. Thus, under an assumption of nor- 
mality and when the number of variable p is large, the LARS algorithm can 
be applied to obtain such a linear combination. This also implies that the 
LARS-type variable selection scheme can be conducted even when no binary- 
scale gold standard is available. When the joint distribution of variables is 
unknown, the proposed measure is then approximated using a nonparametric 
kernel density estimation method. In this case, we proposed a TGDM-based 
algorithm to calculate the best linear combination of variables. Based on 
numerical results, we found that our method is numerically stable with com- 
putational advantage when there are large number of variables considered 
and combination of variables is of interest. Moreover, our method can be 
easily extended to an ordinal-scale gold standard with a suitable choice of a 
weight function for cutting points, which will be reported elsewhere. 

Appendix 

Let random variables {Yi, denote a pair of measures from subject z, 
for ? > 1. Suppose that i = 1, . . . ,n} are n independent observed 

values of random variables (Yi, Zi), i = 1, ■ ■ ■ ,n. For a given cutting point 
c, a subject i, i = 1, . . . , n, is assigned " ii Zi > c and otherwise 

labeled as a "control". That is, for a given c, we divide the observed subjects 
into two groups; let Si{c) and So{c) be the case and control groups with 
sample sizes rii and uq, respectively. 

Then we propose a natural estimate of AUG index, AUCi, with continu- 
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ous gold standard, 



Ai = J A{t)dF,{t), (21) 

where A{c) is defined as 

A{c) = ^— Y] i^iVi-Vj), 
no rii ^-^ 

iGSi{cy,jGSo{c) 

ip{u) = 1, if M > 0; = 0.5, if u = and = if n < and Fc{t) is tlie empirical 
estimate of the cumulative distribution function of Z based on {zi, . . . , Zn}. 
However, in practice, it is rare to choose cutting points at ranges near the 
two ends of the distribution of Z. Thus, instead of the whole range of Z, we 
might explicitly define a weight function fc{t) on a particular critical range. 

Since the step function in ( 1211) is not continuously differentiable, a 
smooth estimate of AUCj is defined as 

Ajs = E S (^-^) mdt, (22) 

ieSiity,j€Soit) ^ ^ 

where S{t) is a sigmoid function 1/(1 + exp(— t)) and h is window width. 

Appendix A: Proof of Strong Consistency of Ajs{l) 

The proof of the strong consistency of smoothed AU Ci{l) estimator Ais{l) 
follows from the following three lemmas. 

Lemma 5.1. Suppose that Xi, ■ ■ ■ , X„ is a sequence of independent and iden- 
tically distributed random variables with values in , and a uniformly contin- 
uous density /(■)• Let k{x) be a bounded probability density and the Dirichlet 
series J2'^=i n exp(— 777^), rjn = nh^ converges for any 7 > 0. Then 

/oo 
\fn{x) — f{x)\dx —7- 0, almost surely as n ^ 00, 
'OO 



where fn{x) = ^ J2^=i k{{x — Xi)/h) is a kernel density estimator of f{ 



X) 



(The proof of Lemma ISTTl can be found in iNadarayal (jl989l ). Theorem 3.1, 
page 55. So, it is omitted here.) 
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Lemma 5.2. Suppose that Xi, ■ ■ ■ , Xn is a sequence of independent and iden- 
tically distributed random variables with values in , and a uniformly con- 
tinuous density. Then with probability one, as n ^ oo 

sup \Fn{x) - F{x)\ 0, 

where Fn{-) and F{-) are the empirical distribution and distribution functions 
of X, respectively. 

Proof of Lemma 15.21 

From iNadaraya] (119891 ) (Equation (1.4), page 43), we have 

pr(sup \Fn{x) - F{x)\ > r/n~^/2) < c exp(-2r/2), (23) 

which completes the proof of Lemma 15.21 

Lemma 5.3. Assume that {{yi, zi),-- ■,(?/„, z„,)} are n independent and 
identically distributed samples of {Y G R^,Z G R^), where Z denotes a 
continuous gold standard. For a given c, let f{y\Z > c) be a conditional 
density function of Y given Z > c. Suppose that conditions of Theorem 3 
holds. Then f{-\Z > c) is uniformly continuous. 

Proof of Lemma 15.31 

By the Bayesian theorem, we have 

~. , X fiv, z)dz , , 

^(^|^>^> = W^- '''' 

For any yi E R^ , i = 1, 2, 

= r[fi^\y^)fyiyi) - fi^\yi)fYiy2)]dz + r[fiz\y,)fY{y2) - f{z\y2)fY{y2)]dz 
= [friyi) - /y(y2)][l - F{c\y^)] + [F{z\y2) - F{z\y^)]fY{y2), (25) 

where f{z\y) is a conditional density function of Z given Y = y and fyiv) 
is a density function of marker Y. From the conditions of Theorem 3, we 
have b = pt{Z > c) > 0, /y(-) < M and both /y(-) and F{z\-) - F{z\-) are 
uniformly continuous. Hence, for any e > 0, there exists a 5 > 0, for any yi 
and y2 satisfying \yi — y2\ < S, we have 

\fY{yi)-fY{y2)\<be/2 
\F{z\y2)-F{z\y^)\<be/{2M). (26) 
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Consequently, by 



and f l26p we get that for a given c, 



\~f{y,\Z>c)-~f{y,\Z>c)\ 
< \{\fy{yi) - /y(y2)|(l - F{c\y,)) + \F{z\y2) - F{z\y,)\fy{y,)} 



< e/2 + e/2 = e. 
It follows that f{-\Z > c) is uniformly continuous. 
Proof of Theorem 3: 

By the triangle inequality, we have, for fixed I, 



(27) 



Ajs - AUCi 



< 



A 



Is 



A, 



Ai - AUCi 



{I) + {11) (say). 



(28) 



From Theorem 1, (II) converges to almost surely as n goes to oo; that is 
Ai — AUCi almost surely as n — )■ oo. (29) 
From mil and (fTT]), 

(^) = |/^ E^es,ity,Jesoit)S (^) fc{t)dt-j ^ E^esmJ^so{tMy^ ' yj)fc{t)dt 

eS'i(t);iGSo(t) V'(yi - yj)\fcit)dt. 

Due to ni + no = n, then at least one of ni — oo and no — ?• oo holds as n 
tends to oo. Without loss of generality, assume that rii tends to oo. Then 

(I)<n, 1^ S (^) - F{y,\Z > t)\ mdt 

+ E,e5o{t) \i:J::es.it) Hy^ - y^) - Hy.iz > t)\mdt, (so) 

where F{-\Z > t) is the conditional cumulative distribution function of Y 
given {Z > t}. Let f{-\Z > t) be its conditional density function. By 
Lemma [5.31 f{-\Z > t) is uniformly continuous. 



Let h = n 

^00 



1/5 < a < 1/2. Set r]r, 



nh 



n 



^ and the Dirichlet 



series Yl'^=i'^^^v{~lVn) converges for any 7 > 0. Thus, the conditions of 
Lemma [5TT] are satisfied. Let k{t) denote the derivative of S{t), then k{t) is 
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a bounded probability density. Thus, by Lemma 15. 



SUp,yg^i 



dt 



0, almost surely as n — > c^l) 



From Lemma [5.21 we have 



sup 



- V ^{y^-y)-F{y\Z>t) 
rii ^-^ 



0, almost surely as n — oo.(32) 



ie5i(i) 

From ([30]), and dM]), we prove that 

Ajs — A/ — )■ 0, almost surely as — t- oo. 
Put fl29|) and f l33|) together to complete the proof of Theorem 3. 



(33) 
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